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Abstract 

This  work  introduces  novel  monotonic  analysis  to  deter¬ 
mine  whether  or  not  proposed  image  quality  (IQ)  mea¬ 
sures  are  consistent  with  human  measured  perceptual  qual¬ 
ity  scores.  Specifically,  the  analysis  performs  a  generalized 
likelihood  ratio  test  over  the  Hi  hypothesis  that  the  IQ  mea¬ 
sures  and  the  corresponding  perceptual  measurements  are 
related  via  a  monotonic  function  versus  the  null  hypothe¬ 
sis  that  the  functional  relationship  is  arbitrary.  This  paper 
evaluates  six  proposed  IQ  measures  against  mean  opinion 
scores  using  the  new  monotonic  analysis. 

1  Introduction 

The  next  generation  of  night  vision  goggles  and  night 
scopes  will  fuse  image  intensified  (12)  and  long  wave  infra¬ 
red  (LWIR)  to  create  a  hybrid  image  that  will  enable  sol¬ 
diers  to  better  interpret  their  surroundings  during  nighttime 
missions.  The  key  to  such  systems  is  the  determination  of 
the  best  image  fusion  algorithm  for  a  specific  task.  A  num¬ 
ber  of  image  fusion  algorithms  have  been  proposed  in  the 
literature,  e.g.  (Zhang  and  Blum  1999).  Currently,  a  scien¬ 
tific  evaluation  of  such  algorithms  requires  extensive  and 
expensive  human  perception  studies  to  determine  how  well 
soldiers  can  perform  a  specific  task.  What  is  needed  is  an 
image  quality  (IQ)  measure  than  can  automatically  quan¬ 
tify  the  utility  of  image  fusion  algorithms. 

The  ultimate  goal  is  an  image  model  that  is  able  to  pre¬ 
dict  human  performance  given  a  few  IQ  measures  as  input 
parameters.  This  papers  demonstrates  the  monotonic  corre¬ 
lation  as  a  tool  to  score  the  myriad  of  measures  based  upon 
how  well  an  arbitrary  monotonic  curve  is  able  to  fit  the  re¬ 
lationship  between  computed  IQ  features  and  human  per¬ 
formance.  Previous  work  investigated  the  monotonic  cor- 
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relation  for  the  human  task  of  classification  (Kaplan  et  al. 
2008b)  when  fusing  the  12  and  LWIR  bands.  This  paper 
uses  the  experimental  results  from  (Chen  and  Blum  2008) 
which  also  considers  the  fusion  of  the  12  and  LWIR  bands. 
In  that  work,  multiple  humans  score  imagery  resulting  from 
6  common  fusion  algorithms  based  on  perceived  percep¬ 
tual  quality  over  28  different  scenes  (a  total  of  168  images). 
Furthermore,  6  full-reference  IQ  measures  were  calculated 
over  the  168  images. 

This  paper  is  organized  as  follow.  Section  2  details  the 
perceptual  experiments  including  the  image  fusion  algo¬ 
rithms,  IQ  evaluation,  and  human  perceptual  scoring  that 
was  used.  Then,  Section  3  introduces  the  tools  for  mono¬ 
tonic  analysis.  These  tools  are  used  to  evaluate  proposed 
IQ  measures  in  Section  4.  Finally,  Section  5  provides  con¬ 
cluding  remarks. 

2  Perceptual  Experiment 

The  perceptual  experiment  consisted  of  applying  image  fu¬ 
sion  algorithms  over  registered  12  and  LWIR  images,  cal¬ 
culating  various  IQ  measures  and  measuring  human  prefer¬ 
ence  over  the  fused  images.  The  details  are  provided  in  the 
following  subsections. 

2.1  Image  Fusion  Algorithms 

The  image  fusion  algorithm  takes  input  from  a  number  of 
source  images  and  generates  a  single  fused  image  that  is 
presented  to  the  human  user  for  interpretation.  Image  fu¬ 
sion  has  a  number  of  applications  including  remote  sens¬ 
ing,  concealed  weapon  detection,  and  night  vision  (Simone 
et  al.  2002;  Chen  et  al.  2005;  Blum  and  Liu  2006).  Two 
main  classes  of  fusion  algorithms  exist.  The  first  gener¬ 
ates  a  gray  scale  image  by  determining  which  information 
to  include  from  the  various  source  images  (see  (Zhang  and 
Blum  1999)  for  an  excellent  review  of  such  methods).  The 
second  class  generates  a  color  image  by  mapping  different 
source  images  into  different  color  spaces,  e.g.,  (Waxman 
et  al.  1997).  This  class  of  methods  is  only  appropriate  when 
three  or  fewer  sources  are  used.  This  work  only  considers 
gray  scale  fusion. 

For  the  experiments,  six  gray  scale  image  fusion  meth- 
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ods  were  implemented  including:  1)  simple  pixel  aver¬ 
aging  (Bender  et  al.  2003),  2)  discrete  wavelet  transform 
(DWT)  (Huntsberger  and  Jawerth  1993),  2)  Filter-Subtract- 
Decimate  pyramid  (FSD)  (Anderson  1988),  3)  Laplacian 
pyramid  (LAP)  (Burt  and  Adelson  1983),  5)  Morpholog¬ 
ical  pyramid  (Morph)  (Toet  1989),  and  6)  Shift  invariant 
DWT  (SiDWT)  (Rockinger  1997).  Note  that  only  the  fused 
and  not  the  original  12  and  LWIR  channels  were  evaluated 
in  the  human  perception  experiments.  The  details  about  the 
exact  implementation  of  the  fusion  methods  is  available  in 
(Chen  and  Blum  2008).  Figure  1  shows  examples  of  im¬ 
agery  generated  by  these  six  fusion  methods. 

2.2  Image  Quality  Measures 

A  full  reference  IQ  measure  quantifies  the  similarity  of  a 
processed  image  against  the  original  image  (or  images).  A 
number  of  such  IQ  measures  have  been  proposed  to  eval¬ 
uate  image  compression  algorithms.  The  typical  metrics 
to  evaluate  compressed  imagery,  e.g.,  mean  squared  error 
(mse)  and  peak  signal  to  noise  ratio  (PSNR)  are  known 
to  be  poor  IQ  metrics,  and  more  relevant  metrics  are  de¬ 
scribed  and  evaluated  in  (Wang  et  al.  2004)  for  the  appli¬ 
cation  of  image  compression.  These  measures  are  easily 
adapted  for  image  fusion  algorithms  where  the  fused  IQ  is 
the  weighted  average  of  the  IQ  measure  between  the  fused 
image  and  each  of  the  source  images  (Piella  2004;  Xy- 
deas  and  Petrovic  2000;  Qu  et  al.  2002;  Chen  and  Varshney 
2007;  Wang  et  al.  2004;  Chen  and  Blum  2008).  In  effect, 
these  full  reference  features  quantify  how  well  “salient  fea¬ 
tures”  in  the  fused  imagery  matches  the  “salient  features” 
in  the  source  images.  Table  1  summarizes  the  six  poten¬ 
tial  full  reference  IQ  measures  considered  in  this  work  and 
point  to  appropriate  references. 

2.3  Perceptual  Image  Evaluation 

This  paper  performed  perceptual  evaluation  of  28  scenes 
consisting  of  co-registered  12  and  LWIR  imagery.  Specifi¬ 
cally,  these  scenes  were  processed  using  the  6  fusion  algo¬ 
rithms  described  in  Section  2.1.  Figure  1  shows  an  exam¬ 
ple  of  one  of  the  scenes.  The  source  12  and  LWIR  images 
are  provided  in  Figures  1(a)  and  (b),  respectively.  The  12 
image  provides  finer  resolution,  texture  and  better  context 
than  the  LWIR  image.  On  the  other  hand,  the  contrast  is 
better  in  the  LWIR  image.  In  fact,  the  human  is  only  vis¬ 
ible  in  the  LWIR  image.  Figures  l(c)-(h)  provide  the  cor¬ 
responding  fused  images.  All  the  fused  images  contain  the 
signature  of  the  human.  The  pixel-averaging  provided  the 
poorest  contrast.  In  terms  of  contrast  and  texture,  one  could 
argue  that  the  DWT  and  Morph  images  are  slightly  better 
than  the  FSD,  LAP,  and  SiDWT  images. 

Human  observers  provided  opinion  scores  ranging  from 
one  (worst  quality)  to  10  (best  quality)  for  each  of  the  168 
fused  images.  The  mean  opinion  scores  (MOS)  and  asso¬ 
ciated  sample  variances  are  provided  in  (Chen  and  Blum 


2008)  for  each  fusion  method  and  each  scene.  In  this  work, 
the  MOS  represents  the  perceptual  score  of  the  images. 

3  Monotonic  Analysis 

A  potential  IQ  measure  is  simply  a  deterministic  mapping 
of  an  image  into  a  scaler  that  quantifies  how  well  the  im¬ 
age  actually  portrays  the  scene.  For  image  fusion  applica¬ 
tions,  the  IQ  measure  indicates  how  well  the  relevant  de¬ 
tails  in  the  source  images  are  preserved  in  the  fused  image. 
On  the  other  hand,  the  perceptual  scores  can  be  viewed  as 
noisy  measurements.  A  repeat  of  the  perception  experi¬ 
ments  with  the  same  imagery  should  lead  to  similar  but  not 
the  same  results.  Thus,  it  should  be  reasonable  to  model  the 
perceptual  scores  as  the  nominal  result  embedded  in  noise. 

The  actual  individual  preference  for  a  given  image  can 
be  biased  by  the  content  in  the  scene.  In  other  words,  it 
is  possible  for  one  fusion  method  to  generate  a  desirable 
image  for  one  scene,  but  not  the  other.  As  a  result,  the  rel¬ 
ative  rankings  of  the  utility  of  the  different  image  fusion 
algorithms  can  change  from  scene  to  scene.  A  desirable 
IQ  measure  should  track  the  relative  rankings  over  the  var¬ 
ious  scenes.  This  section  details  a  hypothesis  test  to  de¬ 
termine  whether  or  not  the  proposed  IQ  measure  and  the 
perception  results  demonstrate  a  consistent  monotonic  re¬ 
lationship  over  the  scenes  under  test.  First,  the  data  models 
for  the  IQ  and  perceptions  results  are  provided.  Next,  the 
concept  of  monotonic  correlation  is  introduced.  Finally, 
the  relationship  between  the  monotonic  correlation  and  a 
generalized  likelihood  ratio  test  is  shown. 

3.1  Data  Models 

Given  that  Nf  fusion  algorithms  are  under  consideration, 
let  the  Nf  x  1  vector  x  represent  a  given  IQ  measure  evalu¬ 
ated  over  the  Nf  fused  images  associated  to  a  given  scene. 
Likewise,  let  the  Nf  x  1  vector  y  represent  the  MOS  values 
collected  over  the  same  Nf  fused  images.  The  pair  (x*,  yi) 
represents  the  IQ  measure  and  MOS  value  for  the  <-th  fused 
image.  The  vector  x  is  deterministic  because  it  represents 
IQ  results.  On  the  other  hand,  we  model  the  MOS  value  for 
the  i- th  fusion  method  as 

Hi  =  Hi  +  Hi,  (1) 

where  ~  N( 0,  cr^)  due  to  the  central  limit  theorem.  The 
mean  value  //,;  is  taken  to  be  the  sample  mean  of  the  opinion 
scores  tabulated  over  the  i- th  fused  image.  The  variance  of 
the  measurement  noise  is  taken  to  be  the  sample  vari¬ 
ance  over  all  opinion  scores  for  the  scene  divided  by  Nf. 

A  statistic  to  evaluate  the  usefulness  of  the  proposed 
IQ  measure  that  generates  the  vector  x  must  quantify  how 
well  the  pairs  (x,  y)  support  the  hypothesis  that  there  ex¬ 
ist  an  arbitrary  monotonic  function  /imono(-)  such  that  /x*  = 
hmmo(xi).  Equivalently,  the  monotonic  hypothesis  indicates 
that  either  X{  >  Xk  implies  //,-  >  /Zfc  (monotonically  in- 


1 

Universal  Quality  Index  (UI) 

Average  Structure  SIMilarity  (SSIM)  index 
between  fused  and  reference  images 

(Wang  et  al.  2004) 

2 

Information 

Measures  (MI) 

Average  mutual  information  between  fused 
and  reference  images  (bin  size  =16) 

(Qu  et  al.  2002) 

3 

Objective  Measure  (QE) 

Average  objective  edge  information 
between  fused  and  reference  images 

(Xydeas  and  Petrovic  2000) 

4 

Mannos  Quality  Index  (Qm) 

HVS  quality  index  using  the  Mannos  &  Sakrison 
constrast  sensitivity  filter 

(Chen  and  Blum  2008) 

5 

Barton  Quality  Index  (Qb) 

HVS  quality  index  using  using  the  Barton 
contrast  sensitivity  fitler 

(Chen  and  Blum  2008) 

6 

Difference  Quality  Index  (Qd) 

H V S  quality  index  using  using  the 
difference  of  Gaussian  contrast  sensitivity  filter 

(Chen  and  Blum  2008) 

Table  1 :  List  of  potential  full-reference  IQ  measures  evaluated  in  this  paper. 


creasing)  or  Xi  >  Xk  implies  /j,:  <  (monotonically  de¬ 
creasing).  In  reality,  a  proper  IQ  measure  should  exhibit  a 
monotonically  increasing  relationship  with  human  perfor¬ 
mance.  However,  if  the  relationship  is  monotonically  de¬ 
creasing,  the  proposed  feature  can  trivially  be  transformed 
into  a  proper  IQ  measure  via  a  negative  or  reciprocal  opera¬ 
tion.  Because  x  and  y  are  the  input  and  noisy  output  values 
to  the  function  /imono(-),  respectively,  we  refer  to  x  and  y  as 
the  input  and  output  vectors,  respectively,  in  the  sequel. 

3.2  Monotonic  Correlation 

The  standard  Pearson  correlation  can  be  viewed  as  the 
square  root  of  the  coefficient  of  determination  (R2)  that 
is  obtained  by  fitting  a  line  to  the  samples  ( )  for 
i  =  1 , . . . ,  Nf .  Motivated  by  this  interpretation  of  the 
Pearson  correlation,  we  define  the  monotonic  correlation 
(MC)  as  the  R2  value  that  is  obtained  by  fitting  an  arbitrary 
monotonic  curve  to  the  ( Xj,t/j )  samples.  To  this  end,  the 
samples  are  reindexed  so  that  that  the  values  of  x  are  in 
ascending  order,  i.e.,  X\  <  X2  <  •••  <  Xms ■  Then,  the 
monotonic  fit  is  determined  by  selecting  values  y  that  are 
in  either  ascending  or  descending  order  such  that  means 
squared  difference  between  y  and  y  is  minimized.  The 
monotonic  fit  can  be  found  by  solving  two  Quadratic  Pro¬ 
gramming  (QP)  problems 

yT  =  argmin  ||y  -  z||2,  yi  =  argmin  ||y  -  z||2(2) 

S.t.  Z\  <  Z2  <  .  .  .  <  2jV,  S.t.  Zi  >  Z2  >  ■  ■  ■  >  2jV, 

Note  that  for  the  case  that  some  input  values  are  equal,  e.g., 
Xi  =  xl+x  =  ■  ■  ■  Xi+k,  then  the  corresponding  inequalities 
constraints  become  active,  i.e,  Zi  =  Zi+ 1  =  •  •  •  =  Zi+k, 
because  the  arbitrary  mono  tonic  function  cannot  produce 
more  that  one  output  value  for  the  same  input  value.  Then, 
y  is  the  y  ;  or  y ,  that  leads  to  the  lowest  residual  error, 

y r  if  lly-ytlla  <  lly-yilh.  (3) 

yi  otherwise. 


can  be  found  without  worrying  about  the  initial  guess  for 
y.  In  fact,  these  QP  problems  are  examples  of  the  same 
well  known  isotonic  regression  problem,  and  the  pool  ad¬ 
jacent  violators  (PAV)  algorithm  can  determine  the  exact 
optimal  values  of  y^  and  y ,  in  Nf  steps,  (Barlow  et  al. 
1972;  Hanson  et  al.  1973).  In  fact,  it  is  shown  in  (Best  and 
Chakravarti  1990;  Pardalos  and  Xue  1999)  that  an  efficient 
coding  of  the  PAV  requires  only  C)( Nf )  operations. 

Note  that  the  PAV  does  not  account  for  the  active  con¬ 
straints.  To  force  the  active  constraints  when  Xi  =  xi+-t  = 

■  ■  •  =  Xj+fc,  the  output  values  corresponding  to  equal  input 
values  are  replaced  by  the  corresponding  mean  value,  e.g., 

Vo  fcTT  E  ln=i  Vn  for  J  =  +  k  before  entering 

the  PAV.  As  shown  in  (Kaplan  et  al.  2008a),  this  modified 
PAV  will  produce  the  optimal  results. 

The  MC  possesses  many  interesting  properties.  Like  lin¬ 
ear  correlation,  it  is  invariant  to  linear  transformation  of 
the  input  and  output  sequences.  It  is  also  invariant  to  any 
monotonic  transformation  of  the  input  sequence,  because 
such  a  transformation  does  not  change  the  ordering  of  the 
elements  to  solve  (2).  The  MC  is  not  invariant  to  mono¬ 
tonic  transformation  of  the  output  sequence.  The  calcula¬ 
tion  of  the  model  error  places  a  higher  penalty  when  the 
miss-ordered  values  in  the  output  sequence  have  higher 
variance  than  when  these  values  are  tightly  clustered  to¬ 
gether.  As  a  result,  the  MC  is  lower  when  ordering  the 
input  leads  to  larger  non-monotonic  “swings”  in  the  output 
sequence  (see  Figure  2).  Finally,  it  not  difficult  to  show 
that  plin  <  |  pmono  |  because  any  linear  function  is  monotonic 
and  the  monotonic  fit  will  be  at  least  as  good  as  the  lin¬ 
ear  fit.  Figure  3  show  examples  of  linear  and  monotonic 
fits  to  a  scatter  plot  of  points  (x»,  yi)  and  the  corresponding 
correlation  values.  The  figure  also  demonstrates  the  fit  of 
a  logistic  function  to  the  data.  While  the  logistic  function 
provides  a  better  fit  than  linear,  the  logistic  function  does 
not  provide  a  good  fit  for  the  points  whose  x  value  is  grater 
thane  0.95.  This  is  due  to  the  fact  that  the  logistic  function 
can  not  model  two  or  more  inflection  points. 


Finally,  the  R 2  value  for  the  monotonic  fit  determines  the 


MC 


(4) 


where  <j2  is  the  sample  variance  of  the  values  in  y  scaled 
by  Nf,  and  the  sign  is  positive  by  convention  if  y  is  ascend¬ 
ing,  i.e.,  y  =  yT,  and  negative  otherwise.  Alternatively,  the 
MC  can  be  computed  via  the  Pearson  correlation  of  y  and 
y.  For  purposes  of  integrating  likelihoods  ratios  over  dis¬ 
parate  scenes  (see  next  subsection),  we  define  ( isotonic 
increasing  correlation)  or  ( isotonic  decreasing  correla¬ 
tion)  by  substituting  yy  or  yq,  respectively,  for  y. 

The  heart  of  calculating  pmono  is  solving  the  two  QP  prob¬ 
lems  in  (2).  Because  the  function  to  minimize  is  convex 
and  the  feasible  region  defined  by  the  constraints  is  convex, 
there  is  a  unique  minima.  Therefore,  the  optimal  solution 


3.3  Hypothesis  Test 

This  section  connects  the  correlation  analysis  of  the  previ¬ 
ous  section  to  a  hypothesis  test.  The  null  Hq  hypothesis 
is  that  the  IQ  feature  is  not  monotonically  related  to  hu¬ 
man  performance,  and  the  H\  hypothesis  is  that  the  mono¬ 
tonic  relationship  does  exist.  Under  the  null  hypothesis,  the 
ground  truth  human  performance  is  related  to  the  actual  IQ 
feature  via  an  arbitrary  function  h(-)  so  that  based  on  (1), 

yi  =  h(xi)  +  rii,  (5) 

Likewise,  for  the  H 1  hypothesis, 

Vi  =  hmoao(xi)  +  rti.  (6) 

It  is  well  known  that  comparing  the  likelihood  ratio  be- 


=  1.000000 


Pmono  =  °-972086 


Pmono  =  °-999910 


p  =  0.945275 


pto=  0.981409 


p  =  0.996767 


Figure  3:  Examples  of  curve  fits  and  correlation  values: 
(a)  linear,  (b)  logistic,  and  (c)  monotonic. 


Figure  2:  Examples  of  monotonic  fit  and  MC:  (a)  Perfect 
fit  ( Pmono  =  1-0000),  (b)  a  single  miss-ordering  between  two 
input  features  causing  a  large  output  “swing”  lowers  the 
correlation  to  pmono  =  0.9721,  (c)  a  single  miss-ordering 
between  two  input  features  causing  a  small  output  “swing” 
only  lowers  the  correlation  to  pmono  =  0.9999,  and  fd)  a 
cubic  stretching  of  the  feature  values  in  (b)  does  not  change 
the  fit  or  MC. 


tween  two  simple1  hypotheses  to  a  threshold  leads  to  the 
universally  most  powerful  (UMP)  test.  From  (5),  the  like¬ 
lihood  for  the  null  hypothesis  is 


/(x|#o)  = 


1 


E&  (Vi-Hxjy 


exp 


O^n)" 

and  from  (6),  the  likelihood  for  Hi  is 


2cr  2 
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/(x|i7i) 


1 

- wr  exP 

(2t t(j2)  — 


r  **  ; ' 

(8) 


The  nonlinear  functions  /imono(-)  and  h(-)  are  unknown  a 
priori ,  and  it  is  not  possible  to  compute  the  likelihood  ra¬ 
tio.  Therefore,  we  resort  to  the  generalized  likelihood  ratio 
test  (GLRT)  where  the  unknown  parameters  in  (7)-(8)  are 
replaced  my  their  ML  estimates.  Actually,  when  estimat¬ 
ing  the  ML  estimates  of  the  non-linear  functions,  one  only 
needs  to  consider  the  function  values  at  Xi,  i.e.,  gi  =  h{xi) 
(or  gi  =  hmono(xi))  under  the  H0  (or  Hi)  hypothesis,  for 
i  =  1, . . . ,  Nf.  The  maximization  of  (8)  is  equivalent  to 
the  monotonic  regression  given  by  (2)  and  (3).  On  the  other 
hand,  the  maximization  of  (7)  trivially  selects  h(-)  so  that 
gt  =  y,  for  /  =  1, . . . ,  Nf.  Therefore  the  generalized  like¬ 
lihood  ratio  (GLR)  is 


A  =  exp 


(9) 


and  given  (4),  the  relationship  between  the  GLR  and  the 
MC  is  derived  to  be 


A  =  exp 


Note  that  erf  represents  the  inter-fusion  method  spread  of 
the  perception  scores.  In  a  similar  vain,  erf  represents  the 
intra-fusion  method  spread  of  the  perception  scores,  and 
the  ratio 


K  =  — 


(11) 


is  analogous  to  a  class  separability  criteria  used  in  discrim¬ 
inant  analysis  (Fukunaga  1990).  We  refer  to  this  ratio  as 
the  separability  ratio  in  the  sequel.  Since  the  magnitude 
of  the  correlation  is  bound  by  zero  and  one,  A  can  take  on 
values  from  exp  (— | K)  to  one.  The  GLR  is  always  less 
than  or  equal  to  one  because  the  arbitrary  Hq  fit  can  never 
be  worse  than  the  monotonic  fit.  GLR  values  close  to  one 
indicate  a  high  likelihood  that  the  fit  between  the  x  and  y 
is  monotonic. 

When  the  likelihood  ratio  is  much  greater  than  one,  there 
is  compelling  evidence  that  the  Hi  hypothesis  is  true.  In 


1 A  simple  hypothesis  is  one  with  completely  known  likelihood  func- 
tion  which  has  no  unknown  parameters. 


other  words,  the  support  for  the  Hi  hypothesis  is  statisti¬ 
cally  significant.  Because  a  high  MC  leads  to  a  GLR  value 
close  to  one  no  matter  the  separability  ratio,  it  is  difficult 
to  determine  the  statistical  significance  of  the  large  value. 
Actually,  the  significance  of  the  MC  is  intertwined  with  the 
spread  of  possible  GLR  values  under  the  null  hypothesis. 
Note  that  the  input  x  influences  the  MC  strictly  by  how  it 
sorts  out  the  output  y.  Under  the  null  hypothesis,  the  rank¬ 
ing  of  perception  results  are  unrelated  to  the  ranking  of  the 
feature  values.  Therefore,  the  GLR  value  could  result  from 
any  arbitrary  ordering  of  the  perception  values,  i.e.,  the  in¬ 
dex  i  for  yi  is  arbitrary.  Thus,  to  gain  insight  about  the 
significance  of  the  GLR  value  (based  on  a  particular  x),  it 
is  instructive  to  take  the  ratio  of  A  over  the  expected  value 
of  A  given  random  input  feature  values  drawn  from  an  un¬ 
informative  prior,  which  we  label  as  A.  Under  an  uninfor¬ 
mative  prior  for  the  input  x,  the  sorting  of  y  is  one  of  Nf\ 
possibilities  with  equal  probability.  For  small  Nf,  A  can 
be  computed  by  averaging  over  all  Nf  \  possible  values  of 
A,  but  when  Nf  is  large,  one  must  resort  to  averaging  over 
Monte-Carlo  trials.  We  define  the  normalized  GLR  as 


(12) 


so  that  A  can  exceed  one.  When  the  separability  ratio  is 
K  =  0,  the  normalized  GLR  is  always  one,  A  =  1,  and 
when  the  MC  is  close  to  one,  there  is  no  compelling  evi¬ 
dence  to  support  the  Hi  hypothesis.  When  |pmono|  =  1  and 
K  >  0,  A  >  1.  As  K  becomes  larger,  so  does  A  >  1,  and 
the  evidence  to  support  the  Hi  hypothesis  becomes  more 
significant.  For  values  of  |pmono|  near  one,  A  increases  as  K 
becomes  larger  than  zero.  However,  as  K  approaches  infin¬ 
ity,  A  goes  down  to  zero.  In  other  words,  when  the  spread  is 
zero,  the  results  are  meaningless  to  make  any  conclusions. 
As  the  spread  goes  to  infinity,  there  is  no  measurement  error 
and  (x,  y)  must  trace  out  a  monotonic  curve.  In  between,  a 
IPmonol  near  one  may  be  significant. 

When  performing  monotonic  analysis  over  multiple 
scenes,  the  sign  of  the  correlation  should  be  consistent  from 
one  scene  to  the  next.  Otherwise,  it  is  impossible  to  de¬ 
termine  if  a  higher  feature  score  translates  to  high  or  low 
quality.  As  a  result,  one  should  consider  the  isotonically 
increasing  GLR  and  isotonically  decreasing  GLR  by  substi¬ 
tuting  p-\  or  respectively,  for  pmom  in  (10).  Then  if  Aj;S 
and  Ajs  are  the  normalized  isotonic  likelihoods  for  the  s-th 
scene,  the  overall  normalized  GLR  for  all  Ns  scenes  is 


A  =  max 


'  Ns 


N. 


ks  =  l 


5  —  1 


(13) 


4  Data  Analysis 


The  monotonic  analysis  described  in  Section  3  was  used  to 
evaluate  the  results  of  the  perceptual  experiment  described 
in  Section  2.  Table  2  summarizes  statistics  about  pmono  and 


Statistic 

Qm 

Qb 

Qd 

QE 

UI 

MI 

max  |pmo„0| 

0.999 

1.000 

1.000 

1.000 

1.000 

0.999 

min  |  pmo„o  | 

0.369 

0.466 

0.687 

0.762 

0.375 

0.598 

mean  pmono 

0.586 

0.662 

0.955 

0.943 

0.572 

-0.861 

#  Pmono  >  0 

23 

25 

28 

28 

24 

0 

#AS  >  4 

13 

9 

20 

15 

2 

6 

#AS  >  1 

18 

11 

24 

24 

6 

16 

A 

4.282e-22 

5.227e-24 

4.924e+22 

5.833e+19 

8.264e-44 

7.221e-03 

Table  2:  Statistics  describing  the  significance  of  the  perception  results  via  the  monotonic  analysis. 


As  over  each  of  the  scenes.  For  every  IQ  measure,  there 
is  at  least  one  scene  where  the  monotonic  fit  between  the 
measure  values  and  the  MOS  is  very  good.  However,  on 
average  the  monotonic  fit  is  only  high  for  the  Qd  and  QE 
IQ  measures.  Furthermore,  the  MCs  for  the  Qd  and  QE 
measures  are  positive  for  all  scenes.  Likewise,  the  MCs 
for  the  MI  measure  is  always  negative  for  the  MI  measure. 
When  the  normalized  GLR  threshold  is  four,  the  Q  D  mea¬ 
sure  is  significant  for  the  most  scenes  followed  by  QE.  In 
fact,  for  all  but  four  scenes,  the  normalized  GLRs  for  Q  n 
and  QE  exceeds  one.  Finally,  the  overall  normalized  GLR 
A  was  computed  by  (13)  and  is  included  in  Table  2.  Over¬ 
all,  the  28  scenes  support  the  fact  that  Qd  and  QE  exhibit 
a  monotonic  relationship  with  human  perceptual  quality. 
The  monotonic  analysis  does  not  support  the  other  four 
measures  as  good  IQ  measures.  The  relative  rankings  of 
the  measures  via  the  overall  normalized  GLR  is  consistent 
with  the  scoring  mechanisms  presented  in  (Chen  and  Blum 
2008)  with  the  exception  that  MI  is  third  as  opposed  to  six. 
This  is  due  to  the  fact  that  the  analysis  in  this  paper  accepts 
monotonic  decreasing  relationships  since  the  measure  can 
be  transformed  into  an  IQ  measure  by  taking  the  recipro¬ 
cal.  Nevertheless,  the  overall  normalized  GLR  score  for 
MI  is  well  below  a  value  of  one.  Finally,  none  of  the  fea¬ 
ture  scoring  mechanisms  in  (Chen  and  Blum  2008)  indicate 
the  extent  of  the  support  for  the  hypothesis  that  a  proposed 
feature  is  a  good  IQ  measure. 

5  Conclusions 

This  work  provides  novel  analysis  to  measure  the  suitabil¬ 
ity  of  proposed  IQ  measures  using  results  from  human  per¬ 
ceptual  experiments.  The  basis  of  this  foundation  is  the 
use  of  a  MC  statistic  that  determines  to  what  degree  does  a 
monotonic  relationship  exists  between  a  proposed  IQ  mea¬ 
sure  and  human  perceptual  score.  As  demonstrated  in  this 
paper,  the  MC  is  more  general  than  linear  and  logistic  cor¬ 
relations.  This  work  also  shows  the  connection  between  the 
MCs  and  a  hypothesis  test  attempting  to  decide  if  the  pro¬ 
posed  IQ  measures  exhibit  a  monotonic  relationship  with 
human  perception  performance.  Finally,  the  paper  intro¬ 
duces  the  concept  of  the  normalized  GLR  to  evaluate  the 


statistical  significance  of  the  MC,  or  corresponding  GLR 
value,  in  light  of  any  random  ordering  of  the  human  per¬ 
ception  results.  The  monotonic  analysis  was  used  to  eval¬ 
uate  6  IQ  measures.  The  analysis  reveals  the  effectiveness 
of  the  objective  measure  (QE)  and  the  HVS  quality  index 
using  the  difference  of  Gaussian  contrast  sensitivity  filter 
(Qd)- 
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